Skip to content

Conversation

@LeiWang1999
Copy link
Contributor

@LeiWang1999 LeiWang1999 commented Sep 26, 2024

Introduce test cases for TL.

TODO Items

  • Pure TL GEMM with Swizzling
  • TL GEMM with Ladder Transformation
  • TL GEMM with Block Reduction
  • TL GEMM with Dequant

@LeiWang1999
Copy link
Contributor Author

The syntax:

# Perform Matrix Multiplication
ptx_macro_generator.MMA(ptx_macro_generator, A_local, B_local, C_local)

is a bit ugly, maybe we can find a way to avoid duplicating the ptx_macro_generator.

@LeiWang1999
Copy link
Contributor Author

optimize to:

  for ki in T.serial(0, (block_K // micro_size_k)):

      # Load A into fragment
      mma_emitter.ldmatrix_a(
          A_local,
          A_shared,
          ki,
          thread_bindings=thread_bindings,
      )

      # Load B into fragment
      mma_emitter.ldmatrix_b(
          B_local,
          B_shared,
          ki,
          thread_bindings=thread_bindings,
      )

      # Perform Matrix Multiplication
      mma_emitter.mma(A_local, B_local, C_local)

# Perform STMatrix
mma_emitter.stmatrix(
  C_local,
  C_shared,
  thread_bindings=thread_bindings,
)

@LeiWang1999 LeiWang1999 merged commit 150815b into microsoft:main Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant